Skip to content

Conversation

@luke-lin-vmc
Copy link
Contributor

@luke-lin-vmc luke-lin-vmc commented Sep 23, 2025

Whisper sample code to enable model caching on GPU and NPU

This is #2751 follow up

Sample Code Reference:
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/encrypted_model_vlm.py#L87
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/text_generation/encrypted_model_causal_lm.cpp#L52

OPTIMIZE_SIZE and encryption are not included. The main performance concern for Whisper is pipeline speed. Since Whisper is much smaller than LLMs, size optimization offers only very little savings while potentially adding latency. Similarly, model encryption can also introduce additional latency.

ov::AnyMap ov_config;
if (device == "NPU" || device.find("GPU") != std::string::npos) { // need to handle cases like "GPU", "GPU.0" and "GPU.1"
// Cache compiled models on disk for GPU and NPU to save time on the
// next run. It's not beneficial for CPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's not beneficial for CPU?

Copy link
Contributor Author

@luke-lin-vmc luke-lin-vmc Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This comment is simply copied from the reference sample code.
  2. AFAIK CPU plugin's "compile" step is mostly graph rewrites and primitive selection. It’s typically milliseconds–a few hundred ms, not seconds–minutes like on GPU/NPU.
  3. Most importantly, enable model caching on CPU causes Whisper pipeline crashed. This looks like a bug which needs further investigation. So currently model caching is enabled only on GPU and NPU to avoid the issue.

@Wovchena Wovchena requested a review from Copilot October 16, 2025 11:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds GPU/NPU model caching configuration to Whisper speech recognition sample code.

  • Introduces helper to build caching config in both Python and C++ samples.
  • Applies conditional logic to enable caching only on GPU/NPU devices.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
samples/python/whisper_speech_recognition/whisper_speech_recognition.py Adds cache config helper and conditional passing of CACHE_DIR to WhisperPipeline.
samples/cpp/whisper_speech_recognition/whisper_speech_recognition.cpp Adds cache config helper and conditional AnyMap passed to WhisperPipeline.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +26 to +31
ov_config = dict()
if args.device == "NPU" or "GPU" in args.device: # need to handle cases like "GPU", "GPU.0" and "GPU.1"
# Cache compiled models on disk for GPU and NPU to save time on the
# next run. It's not beneficial for CPU.
ov_config = get_config_for_cache()

Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition handles GPU variants (e.g. GPU.0) but will skip NPU variants such as 'NPU.0', limiting caching despite the PR goal to enable it for NPU. Update the condition to also match NPU suffixed forms, e.g.: if 'GPU' in args.device or args.device.startswith('NPU'):. Alternatively use substring checks for both: if 'GPU' in args.device or 'NPU' in args.device:.

Copilot uses AI. Check for mistakes.

Comment on lines +22 to +29
ov::AnyMap ov_config;
if (device == "NPU" || device.find("GPU") != std::string::npos) { // need to handle cases like "GPU", "GPU.0" and "GPU.1"
// Cache compiled models on disk for GPU and NPU to save time on the
// next run. It's not beneficial for CPU.
ov_config = get_config_for_cache();
}

ov::genai::WhisperPipeline pipeline(models_path, device, ov_config);
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition enables caching for GPU variants but misses NPU variants like 'NPU.0', restricting caching contrary to the stated intent. Adjust to also detect NPU substrings: if (device.find("GPU") != std::string::npos || device.find("NPU") != std::string::npos) { ... }.

Copilot uses AI. Check for mistakes.

@Wovchena
Copy link
Collaborator

build_jenkins

@Wovchena Wovchena enabled auto-merge October 16, 2025 13:02
@Wovchena Wovchena added this pull request to the merge queue Oct 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Oct 16, 2025
@Wovchena Wovchena added this pull request to the merge queue Oct 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 17, 2025
@Wovchena Wovchena added this pull request to the merge queue Oct 17, 2025
Merged via the queue into openvinotoolkit:master with commit ad2dd5f Oct 17, 2025
92 checks passed
AsyaPronina pushed a commit to eshiryae/openvino.genai that referenced this pull request Oct 22, 2025
…lkit#2759)

Whisper sample code to enable model caching on GPU and NPU

This is openvinotoolkit#2751
follow up

Sample Code Reference: 

https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/encrypted_model_vlm.py#L87

https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/text_generation/encrypted_model_causal_lm.cpp#L52

OPTIMIZE_SIZE and encryption are not included. The main performance
concern for Whisper is pipeline speed. Since Whisper is much smaller
than LLMs, size optimization offers only very little savings while
potentially adding latency. Similarly, model encryption can also
introduce additional latency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: Whisper samples GenAI Whisper samples

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants